Marketing Analytics Process

Let’s Get Set Up With R/RStudio

R/RStudio Orientation

  • Console: Where you run code and perform calculations.
5 + 5
  • Source: Create and save R scripts and send code to the Console.
# Use comments to explain (the why of) your code.
print(2 * (18 - 7))
  • Environment: A snapshot of what you have loaded.
  • Help: Look up documentation including cheatsheets.
  • Files: The working directory for your project.

RStudio Tools > Global Options

  • General > Save workspace to .RData on exit: Never.
  • Code > Editing > Execution > Ctrl + Enter executes: Multiple consecutive R lines.
  • Code > Editing > Use native pipe operator.
  • Code > Display > General > Highlight selected line.
  • Code > Display > General > Rainbow parentheses.
  • Code > Saving > Automatically save when editor loses focus.
  • Appearance.

Functions

We import, wrangle, visualize, and model data using functions. Functions are composed of arguments that tell the function how to operate.

  • Using a function is referred to as a “call” or a “function call.”
  • Don’t forget you have Help.

The tidyverse

The tidyverse is a collection of R packages that share common philosophies and are designed to work together. An R package is a collection of functions, documentation, and sometimes data.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Import CSV Files

customer_data <- read_csv("customer_data.csv")
## Rows: 10531 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): gender, married, college_degree, region, state, review_time, review...
## dbl (6): customer_id, birth_year, income, credit, review_id, star_rating
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

  • Note that customer_data now appears in our Environment.
  • The Environment lists any objects (i.e., data or custom functions) we’ve assigned a name using <-.
  • We can also view objects in their own tab.

Data Frames

customer_data
## # A tibble: 10,531 × 14
##    customer_id birth_year gender income credit married college_degree region   
##          <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
##  1        1001       1971 Female  73000   742. No      No             South    
##  2        1002       1970 Female  31000   749. Yes     No             West     
##  3        1003       1988 Male    35000   542. No      No             South    
##  4        1004       1984 Other   64000   574. Yes     Yes            Midwest  
##  5        1005       1987 Male    58000   644. No      Yes            West     
##  6        1006       1994 Male   164000   554. Yes     Yes            Midwest  
##  7        1007       1968 Male    39000   608. No      No             Midwest  
##  8        1008       1994 Male    69000   710. No      No             South    
##  9        1009       1958 Male   233000   702. No      No             West     
## 10        1010       1994 Female  77000   605. Yes     No             Northeast
## # ℹ 10,521 more rows
## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>,
## #   review_time <chr>, review_title <chr>, review_text <chr>

glimpse(customer_data)
## Rows: 10,531
## Columns: 14
## $ customer_id    <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1…
## $ birth_year     <dbl> 1971, 1970, 1988, 1984, 1987, 1994, 1968, 1994, 1958, 1…
## $ gender         <chr> "Female", "Female", "Male", "Other", "Male", "Male", "M…
## $ income         <dbl> 73000, 31000, 35000, 64000, 58000, 164000, 39000, 69000…
## $ credit         <dbl> 742.0827, 749.3514, 542.2399, 573.9358, 644.2439, 553.6…
## $ married        <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "No"…
## $ college_degree <chr> "No", "No", "No", "Yes", "Yes", "Yes", "No", "No", "No"…
## $ region         <chr> "South", "West", "South", "Midwest", "West", "Midwest",…
## $ state          <chr> "DC", "WA", "AR", "MN", "HI", "MN", "MN", "KY", "NM", "…
## $ review_id      <dbl> 933551, NA, NA, NA, 501318, 109125, NA, 1959683, NA, 19…
## $ star_rating    <dbl> 4, NA, NA, NA, 5, 2, NA, 5, NA, 5, NA, 5, NA, NA, NA, N…
## $ review_time    <chr> "06 11, 2015", NA, NA, NA, "03 25, 2008", "06 7, 2013",…
## $ review_title   <chr> "Four Stars", NA, NA, NA, "Great Product!!", "Not at al…
## $ review_text    <chr> "everything's fine", NA, NA, NA, "I looked all over the…

Transform Data

The heart of data wrangling is transforming data.

  • Filter observations.
  • Arrange observations.
  • Slice observations.
  • Select variables.
  • Mutate variables (i.e., recode or create new variables).
  • Join data frames.

{dplyr} provides a consistent grammar for transforming data with functions (a.k.a., verbs) that mirror SQL.

Filter Observations

We often want to filter our data to keep certain observations using column values.

filter(customer_data, college_degree == "Yes")
## # A tibble: 7,782 × 14
##    customer_id birth_year gender income credit married college_degree region   
##          <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
##  1        1004       1984 Other   64000   574. Yes     Yes            Midwest  
##  2        1005       1987 Male    58000   644. No      Yes            West     
##  3        1006       1994 Male   164000   554. Yes     Yes            Midwest  
##  4        1013       1974 Female 197000   680. Yes     Yes            West     
##  5        1015       1995 Male   347000   658. No      Yes            West     
##  6        1016       1984 Other  292000   626. Yes     Yes            Northeast
##  7        1017       1997 Male    68000   588. No      Yes            Midwest  
##  8        1018       1984 Male   184000   666. No      Yes            Northeast
##  9        1019       1984 Female  60000   758. No      Yes            West     
## 10        1021       1997 Female  20000   503. No      Yes            West     
## # ℹ 7,772 more rows
## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>,
## #   review_time <chr>, review_title <chr>, review_text <chr>

filter(customer_data, region != "West")
## # A tibble: 5,436 × 14
##    customer_id birth_year gender income credit married college_degree region   
##          <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
##  1        1001       1971 Female  73000   742. No      No             South    
##  2        1003       1988 Male    35000   542. No      No             South    
##  3        1004       1984 Other   64000   574. Yes     Yes            Midwest  
##  4        1006       1994 Male   164000   554. Yes     Yes            Midwest  
##  5        1007       1968 Male    39000   608. No      No             Midwest  
##  6        1008       1994 Male    69000   710. No      No             South    
##  7        1010       1994 Female  77000   605. Yes     No             Northeast
##  8        1012       1953 Female 126000   673. Yes     No             South    
##  9        1014       1997 Male    49000   666. Yes     No             Northeast
## 10        1016       1984 Other  292000   626. Yes     Yes            Northeast
## # ℹ 5,426 more rows
## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>,
## #   review_time <chr>, review_title <chr>, review_text <chr>

How would we filter by gender == "Female" and income > 70000?

Why are we putting quotes around "Female" but not gender?

filter(customer_data, gender == "Female", income > 70000)
## # A tibble: 3,970 × 14
##    customer_id birth_year gender income credit married college_degree region   
##          <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
##  1        1001       1971 Female  73000   742. No      No             South    
##  2        1010       1994 Female  77000   605. Yes     No             Northeast
##  3        1012       1953 Female 126000   673. Yes     No             South    
##  4        1013       1974 Female 197000   680. Yes     Yes            West     
##  5        1022       1979 Female 155000   805. No      Yes            West     
##  6        1023       1995 Female 137000   539. No      Yes            Northeast
##  7        1024       1974 Female 285000   685. Yes     Yes            Midwest  
##  8        1028       1980 Female  87000   715. No      No             West     
##  9        1030       1969 Female 163000   636. Yes     Yes            West     
## 10        1036       1978 Female 227000   614. Yes     Yes            Northeast
## # ℹ 3,960 more rows
## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>,
## #   review_time <chr>, review_title <chr>, review_text <chr>

Slice Observations

Sometimes we want to slice our data to keep certain observations using their position in the data.

slice(customer_data, 1:5)
## # A tibble: 5 × 14
##   customer_id birth_year gender income credit married college_degree region 
##         <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>  
## 1        1001       1971 Female  73000   742. No      No             South  
## 2        1002       1970 Female  31000   749. Yes     No             West   
## 3        1003       1988 Male    35000   542. No      No             South  
## 4        1004       1984 Other   64000   574. Yes     Yes            Midwest
## 5        1005       1987 Male    58000   644. No      Yes            West   
## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>,
## #   review_time <chr>, review_title <chr>, review_text <chr>

Arrange Observations

We can arrange observations to reveal helpful information and check data.

arrange(customer_data, birth_year)
## # A tibble: 10,531 × 14
##    customer_id birth_year gender income credit married college_degree region   
##          <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
##  1        9585       1939 Female 182000   850  Yes     No             Northeast
##  2        2409       1947 Male   227000   807. Yes     Yes            Midwest  
##  3        4777       1947 Male   169000   651. Yes     Yes            Northeast
##  4        5165       1947 Female 258000   850  No      Yes            Northeast
##  5        8277       1947 Male    92000   839. Yes     No             West     
##  6        8525       1947 Female 170000   723. Yes     Yes            West     
##  7        9437       1947 Female  95000   850  No      Yes            West     
##  8       10069       1947 Other   84000   743. Yes     Yes            West     
##  9        4292       1948 Female 169000   760. Yes     No             Northeast
## 10        5282       1948 Male   176000   705. Yes     No             South    
## # ℹ 10,521 more rows
## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>,
## #   review_time <chr>, review_title <chr>, review_text <chr>

arrange(customer_data, desc(birth_year))
## # A tibble: 10,531 × 14
##    customer_id birth_year gender income credit married college_degree region   
##          <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
##  1        1026       1999 Male    66000   643. No      No             South    
##  2        1049       1999 Other   88000   630. No      Yes            West     
##  3        1092       1999 Other   77000   664. No      Yes            Midwest  
##  4        1107       1999 Female  97000   579. No      Yes            West     
##  5        1113       1999 Male   190000   661. Yes     Yes            Northeast
##  6        1126       1999 Female 121000   639. No      Yes            Midwest  
##  7        1132       1999 Male    53000   669. No      No             West     
##  8        1139       1999 Female 293000   659. Yes     Yes            Northeast
##  9        1143       1999 Female  74000   587. No      Yes            West     
## 10        1147       1999 Other  109000   429. Yes     No             West     
## # ℹ 10,521 more rows
## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>,
## #   review_time <chr>, review_title <chr>, review_text <chr>

Select Variables

Sometimes we only care to keep certain variables, especially when working with a large dataset.

select(customer_data, region:review_text)
## # A tibble: 10,531 × 7
##    region    state review_id star_rating review_time review_title    review_text
##    <chr>     <chr>     <dbl>       <dbl> <chr>       <chr>           <chr>      
##  1 South     DC       933551           4 06 11, 2015 Four Stars      everything…
##  2 West      WA           NA          NA <NA>        <NA>            <NA>       
##  3 South     AR           NA          NA <NA>        <NA>            <NA>       
##  4 Midwest   MN           NA          NA <NA>        <NA>            <NA>       
##  5 West      HI       501318           5 03 25, 2008 Great Product!! I looked a…
##  6 Midwest   MN       109125           2 06 7, 2013  Not at all fla… I ordered …
##  7 Midwest   MN           NA          NA <NA>        <NA>            <NA>       
##  8 South     KY      1959683           5 04 20, 2016 Super comforta… Super comf…
##  9 West      NM           NA          NA <NA>        <NA>            <NA>       
## 10 Northeast VT      1907775           5 10 18, 2015 nice strong en… Yeah       
## # ℹ 10,521 more rows

Mutate Variables

We can also recode existing variables or create new variables.

mutate(customer_data, income = income / 1000)
## # A tibble: 10,531 × 14
##    customer_id birth_year gender income credit married college_degree region   
##          <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
##  1        1001       1971 Female     73   742. No      No             South    
##  2        1002       1970 Female     31   749. Yes     No             West     
##  3        1003       1988 Male       35   542. No      No             South    
##  4        1004       1984 Other      64   574. Yes     Yes            Midwest  
##  5        1005       1987 Male       58   644. No      Yes            West     
##  6        1006       1994 Male      164   554. Yes     Yes            Midwest  
##  7        1007       1968 Male       39   608. No      No             Midwest  
##  8        1008       1994 Male       69   710. No      No             South    
##  9        1009       1958 Male      233   702. No      No             West     
## 10        1010       1994 Female     77   605. Yes     No             Northeast
## # ℹ 10,521 more rows
## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>,
## #   review_time <chr>, review_title <chr>, review_text <chr>

Two important things to remember:

  1. We can overwrite variables in a data frame as well as objects if we use the same name.
  2. Good variable and object names only use lowercase letters, numbers, and _. For example: good_name2 and 2bad name. Bad names should be renamed (but can be referenced if surrounded with " ").

Join Data Frames

A common variable (e.g., an ID) allows us to join two data frames.

store_transactions <- read_csv("store_transactions.csv")
## Rows: 10531 Columns: 169
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (169): customer_id, jan_2005, feb_2005, mar_2005, apr_2005, may_2005, ju...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(store_transactions)
## Rows: 10,531
## Columns: 169
## $ customer_id <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010…
## $ jan_2005    <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0…
## $ feb_2005    <dbl> 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 5, 0, 4, 0…
## $ mar_2005    <dbl> 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 5, 0…
## $ apr_2005    <dbl> 4, 0, 0, 0, 0, 0, 4, 0, 0, 2, 0, 0, 0, 4, 0, 0, 0, 1, 0, 0…
## $ may_2005    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ jun_2005    <dbl> 4, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0…
## $ jul_2005    <dbl> 0, 0, 0, 0, 0, 0, 0, 2, 4, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ aug_2005    <dbl> 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 2, 0, 0, 4, 0, 0, 0, 0, 0, 0…
## $ sep_2005    <dbl> 0, 0, 0, 0, 0, 4, 0, 5, 4, 0, 0, 0, 0, 2, 0, 5, 0, 0, 0, 0…
## $ oct_2005    <dbl> 4, 0, 0, 0, 3, 0, 5, 2, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0, 0, 4…
## $ nov_2005    <dbl> 0, 2, 0, 0, 4, 0, 0, 0, 0, 0, 5, 3, 0, 2, 2, 0, 2, 0, 5, 0…
## $ dec_2005    <dbl> 3, 0, 1, 0, 0, 2, 1, 2, 5, 0, 0, 5, 0, 2, 4, 0, 2, 2, 3, 0…
## $ jan_2006    <dbl> 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 5, 3, 0, 4, 2, 3, 0, 0, 4, 4…
## $ feb_2006    <dbl> 2, 0, 0, 0, 0, 4, 0, 0, 0, 0, 5, 0, 4, 0, 0, 0, 0, 0, 0, 0…
## $ mar_2006    <dbl> 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ apr_2006    <dbl> 0, 0, 0, 0, 0, 0, 4, 3, 0, 0, 0, 0, 0, 0, 0, 3, 0, 2, 5, 3…
## $ may_2006    <dbl> 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 4, 3, 0, 0, 0, 0, 0, 0, 0, 0…
## $ jun_2006    <dbl> 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 2, 4, 0, 5…
## $ jul_2006    <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 4…
## $ aug_2006    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ sep_2006    <dbl> 0, 0, 0, 0, 5, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0…
## $ oct_2006    <dbl> 0, 2, 1, 3, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ nov_2006    <dbl> 0, 0, 0, 4, 1, 1, 0, 0, 0, 3, 3, 3, 5, 1, 0, 0, 0, 0, 0, 0…
## $ dec_2006    <dbl> 5, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0, 3, 5, 3, 0, 0, 0, 0, 3…
## $ jan_2007    <dbl> 0, 0, 3, 0, 0, 0, 0, 0, 2, 0, 0, 0, 4, 4, 0, 0, 0, 0, 4, 3…
## $ feb_2007    <dbl> 2, 0, 0, 0, 2, 0, 0, 3, 4, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0…
## $ mar_2007    <dbl> 4, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 0, 0, 0, 3, 0…
## $ apr_2007    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 3, 0, 3, 0, 1…
## $ may_2007    <dbl> 0, 0, 0, 3, 0, 0, 0, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ jun_2007    <dbl> 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 1, 3, 1, 5…
## $ jul_2007    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 0, 0, 2, 0, 0, 0, 0, 2…
## $ aug_2007    <dbl> 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ sep_2007    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 4, 3…
## $ oct_2007    <dbl> 3, 3, 0, 0, 0, 3, 0, 0, 0, 0, 4, 3, 0, 0, 0, 0, 0, 5, 0, 0…
## $ nov_2007    <dbl> 0, 2, 0, 5, 0, 5, 0, 0, 0, 5, 1, 0, 4, 0, 0, 2, 2, 3, 0, 3…
## $ dec_2007    <dbl> 0, 4, 0, 0, 4, 0, 0, 0, 0, 0, 0, 4, 0, 1, 4, 0, 4, 0, 3, 0…
## $ jan_2008    <dbl> 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 2, 0, 0, 4, 0, 0, 0, 0…
## $ feb_2008    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ mar_2008    <dbl> 5, 5, 0, 0, 0, 0, 3, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ apr_2008    <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 4, 0, 0, 1, 1…
## $ may_2008    <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0…
## $ jun_2008    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 2, 2, 0, 0…
## $ jul_2008    <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 1, 2, 0, 0, 4, 0, 2, 4, 0, 2, 0…
## $ aug_2008    <dbl> 0, 0, 0, 0, 0, 0, 5, 3, 0, 0, 5, 0, 2, 0, 4, 0, 3, 0, 1, 0…
## $ sep_2008    <dbl> 4, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 4, 0, 0, 5, 0, 0, 0, 2…
## $ oct_2008    <dbl> 0, 4, 0, 0, 0, 0, 0, 0, 5, 4, 2, 0, 3, 0, 0, 3, 3, 0, 0, 4…
## $ nov_2008    <dbl> 1, 0, 0, 2, 2, 4, 0, 0, 0, 0, 2, 0, 0, 0, 0, 4, 0, 2, 4, 0…
## $ dec_2008    <dbl> 0, 0, 0, 0, 2, 3, 1, 0, 0, 0, 2, 2, 0, 4, 4, 0, 1, 0, 5, 1…
## $ jan_2009    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ feb_2009    <dbl> 2, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0…
## $ mar_2009    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0…
## $ apr_2009    <dbl> 3, 0, 0, 1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 4, 0, 0, 0, 0, 2, 5…
## $ may_2009    <dbl> 0, 0, 0, 0, 2, 4, 0, 4, 0, 0, 0, 0, 0, 5, 4, 2, 0, 2, 1, 0…
## $ jun_2009    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ jul_2009    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 2, 3, 0, 1, 5, 3…
## $ aug_2009    <dbl> 0, 4, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3…
## $ sep_2009    <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 0, 0…
## $ oct_2009    <dbl> 5, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0…
## $ nov_2009    <dbl> 5, 0, 0, 0, 0, 1, 0, 2, 2, 0, 4, 2, 0, 3, 4, 0, 4, 1, 2, 4…
## $ dec_2009    <dbl> 0, 0, 0, 3, 3, 0, 0, 4, 1, 4, 2, 2, 0, 0, 3, 0, 0, 0, 5, 1…
## $ jan_2010    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 5…
## $ feb_2010    <dbl> 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0…
## $ mar_2010    <dbl> 4, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0…
## $ apr_2010    <dbl> 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 3, 0, 2, 0…
## $ may_2010    <dbl> 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0…
## $ jun_2010    <dbl> 4, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 3…
## $ jul_2010    <dbl> 0, 0, 4, 0, 0, 0, 5, 5, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 4, 0…
## $ aug_2010    <dbl> 0, 2, 0, 0, 3, 0, 0, 0, 0, 0, 2, 0, 4, 3, 0, 0, 0, 0, 0, 0…
## $ sep_2010    <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 3, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0, 3…
## $ oct_2010    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 2, 0, 0, 4, 0, 0, 0, 0, 2, 0…
## $ nov_2010    <dbl> 0, 0, 1, 2, 3, 5, 4, 1, 0, 4, 0, 3, 5, 0, 0, 0, 0, 3, 3, 0…
## $ dec_2010    <dbl> 0, 0, 5, 3, 0, 0, 5, 0, 0, 0, 0, 0, 4, 4, 0, 5, 3, 0, 0, 5…
## $ jan_2011    <dbl> 2, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 5, 2, 1…
## $ feb_2011    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 2, 0, 0, 0, 0…
## $ mar_2011    <dbl> 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 4…
## $ apr_2011    <dbl> 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 3, 0…
## $ may_2011    <dbl> 0, 0, 0, 0, 0, 0, 0, 2, 0, 4, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4…
## $ jun_2011    <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 3, 0, 0, 0, 1, 0…
## $ jul_2011    <dbl> 4, 0, 0, 0, 5, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0…
## $ aug_2011    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 1, 2…
## $ sep_2011    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ oct_2011    <dbl> 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ nov_2011    <dbl> 5, 0, 0, 1, 0, 0, 5, 1, 2, 0, 0, 0, 4, 0, 5, 0, 0, 2, 3, 2…
## $ dec_2011    <dbl> 1, 4, 0, 4, 1, 5, 0, 4, 0, 0, 2, 0, 0, 0, 0, 2, 2, 4, 3, 4…
## $ jan_2012    <dbl> 3, 0, 0, 0, 0, 3, 0, 0, 4, 4, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0…
## $ feb_2012    <dbl> 4, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 3, 0, 0, 5, 1, 0, 0, 0, 0…
## $ mar_2012    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 2, 0, 0, 0, 3, 0, 0…
## $ apr_2012    <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0…
## $ may_2012    <dbl> 4, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 2…
## $ jun_2012    <dbl> 0, 0, 0, 0, 2, 0, 0, 0, 3, 3, 0, 0, 0, 0, 3, 0, 5, 0, 2, 0…
## $ jul_2012    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0…
## $ aug_2012    <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 2, 3, 0, 0, 0, 0, 1, 0, 0, 4, 2, 0…
## $ sep_2012    <dbl> 0, 5, 1, 4, 0, 0, 0, 2, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ oct_2012    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 4, 0, 0, 0, 0, 0, 0…
## $ nov_2012    <dbl> 3, 2, 0, 0, 0, 0, 0, 5, 4, 4, 2, 0, 0, 5, 0, 2, 3, 0, 1, 5…
## $ dec_2012    <dbl> 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 4, 0, 4, 4, 1, 4, 0, 3, 4, 0…
## $ jan_2013    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 3, 0, 0, 0, 4, 4…
## $ feb_2013    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 2, 0, 0, 0, 0…
## $ mar_2013    <dbl> 0, 4, 0, 3, 4, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0…
## $ apr_2013    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0…
## $ may_2013    <dbl> 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ jun_2013    <dbl> 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 3, 5, 0, 0, 0…
## $ jul_2013    <dbl> 0, 0, 4, 0, 4, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 4, 0…
## $ aug_2013    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ sep_2013    <dbl> 1, 0, 0, 2, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ oct_2013    <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ nov_2013    <dbl> 0, 0, 0, 0, 4, 0, 2, 1, 4, 2, 3, 0, 0, 3, 0, 0, 2, 0, 3, 3…
## $ dec_2013    <dbl> 0, 4, 4, 0, 4, 2, 0, 0, 4, 0, 0, 3, 2, 3, 3, 2, 0, 3, 2, 2…
## $ jan_2014    <dbl> 3, 0, 1, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0…
## $ feb_2014    <dbl> 0, 2, 0, 0, 0, 1, 0, 4, 0, 5, 0, 0, 5, 1, 0, 0, 0, 3, 1, 0…
## $ mar_2014    <dbl> 1, 0, 0, 0, 0, 0, 2, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 4, 2…
## $ apr_2014    <dbl> 0, 3, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 3, 0, 0, 0, 2, 0, 0, 0…
## $ may_2014    <dbl> 0, 3, 0, 0, 0, 0, 0, 0, 0, 4, 4, 0, 3, 2, 0, 0, 0, 0, 0, 2…
## $ jun_2014    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 5, 3, 0…
## $ jul_2014    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 4, 0, 0…
## $ aug_2014    <dbl> 2, 3, 0, 0, 0, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0…
## $ sep_2014    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 2, 5, 0, 1, 4, 0…
## $ oct_2014    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 3, 0, 0, 0, 0, 2…
## $ nov_2014    <dbl> 4, 2, 3, 4, 0, 3, 3, 0, 0, 0, 0, 0, 4, 4, 0, 4, 0, 0, 1, 3…
## $ dec_2014    <dbl> 3, 0, 4, 0, 0, 1, 0, 0, 0, 3, 4, 0, 3, 0, 0, 0, 0, 2, 0, 5…
## $ jan_2015    <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ feb_2015    <dbl> 0, 0, 0, 0, 4, 0, 5, 0, 0, 0, 4, 0, 0, 0, 0, 3, 0, 0, 0, 0…
## $ mar_2015    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 0…
## $ apr_2015    <dbl> 0, 0, 0, 0, 3, 2, 0, 3, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0…
## $ may_2015    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 4…
## $ jun_2015    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 5, 4, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0…
## $ jul_2015    <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 3, 4, 5…
## $ aug_2015    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 5…
## $ sep_2015    <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 4, 3…
## $ oct_2015    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 3, 0, 5, 0, 0, 0, 0…
## $ nov_2015    <dbl> 3, 0, 0, 0, 1, 4, 0, 3, 4, 3, 0, 0, 1, 4, 0, 0, 0, 0, 4, 4…
## $ dec_2015    <dbl> 0, 0, 0, 1, 0, 0, 0, 5, 0, 4, 0, 0, 0, 3, 3, 0, 0, 0, 3, 3…
## $ jan_2016    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 3…
## $ feb_2016    <dbl> 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ mar_2016    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 0, 1, 0, 0, 0, 3, 0…
## $ apr_2016    <dbl> 3, 0, 0, 3, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2…
## $ may_2016    <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 3…
## $ jun_2016    <dbl> 0, 0, 0, 0, 0, 3, 0, 0, 0, 5, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0…
## $ jul_2016    <dbl> 4, 0, 4, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 4, 3, 0, 0, 0, 0, 0…
## $ aug_2016    <dbl> 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 4, 0, 0, 2, 0, 0, 3, 0…
## $ sep_2016    <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 3, 0…
## $ oct_2016    <dbl> 3, 0, 0, 0, 5, 0, 0, 0, 0, 0, 5, 0, 0, 0, 4, 0, 0, 4, 1, 0…
## $ nov_2016    <dbl> 5, 0, 0, 4, 2, 3, 3, 0, 0, 3, 0, 0, 0, 4, 0, 2, 0, 2, 0, 0…
## $ dec_2016    <dbl> 2, 1, 0, 0, 0, 0, 1, 5, 0, 0, 0, 1, 3, 0, 4, 2, 2, 0, 2, 0…
## $ jan_2017    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 4, 0…
## $ feb_2017    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 0, 0, 0, 0, 0, 0…
## $ mar_2017    <dbl> 0, 1, 0, 0, 0, 1, 0, 2, 3, 0, 0, 0, 0, 4, 0, 5, 0, 0, 0, 4…
## $ apr_2017    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0…
## $ may_2017    <dbl> 0, 0, 0, 5, 0, 0, 0, 0, 0, 1, 2, 0, 0, 2, 0, 3, 3, 0, 4, 3…
## $ jun_2017    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 4, 0, 0, 2, 0, 0…
## $ jul_2017    <dbl> 3, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 3, 3, 0, 0, 0, 2…
## $ aug_2017    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 2, 0, 0, 0, 1, 1, 0, 0…
## $ sep_2017    <dbl> 2, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 5, 4, 0, 0, 0, 0, 0…
## $ oct_2017    <dbl> 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 4, 0…
## $ nov_2017    <dbl> 0, 2, 5, 0, 0, 1, 4, 0, 2, 0, 5, 0, 2, 5, 0, 0, 3, 0, 4, 0…
## $ dec_2017    <dbl> 0, 0, 4, 0, 0, 0, 0, 5, 5, 5, 4, 4, 0, 0, 2, 4, 0, 3, 0, 2…
## $ jan_2018    <dbl> 1, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 2, 0, 0, 0, 0, 3, 1…
## $ feb_2018    <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ mar_2018    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 2, 0…
## $ apr_2018    <dbl> 0, 0, 0, 2, 0, 0, 4, 0, 0, 0, 0, 1, 0, 0, 0, 2, 1, 4, 0, 0…
## $ may_2018    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ jun_2018    <dbl> 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 2, 0, 0, 0, 0, 3…
## $ jul_2018    <dbl> 0, 0, 0, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0, 0, 0, 2, 0, 0, 0, 5…
## $ aug_2018    <dbl> 0, 0, 0, 0, 0, 0, 4, 0, 0, 3, 0, 0, 0, 0, 0, 4, 0, 0, 0, 5…
## $ sep_2018    <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 5, 0, 2, 0, 0, 0, 0, 2, 4, 0…
## $ oct_2018    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ nov_2018    <dbl> 0, 0, 4, 0, 3, 0, 0, 2, 0, 4, 1, 0, 1, 0, 0, 0, 0, 0, 5, 0…
## $ dec_2018    <dbl> 0, 5, 0, 0, 2, 0, 0, 4, 0, 0, 0, 4, 0, 4, 1, 0, 0, 0, 2, 0…

A left join keeps all rows in the “left” data frame plus all columns from both data frames. A left join is a good “default” join to start with.

left_join(customer_data, store_transactions, join_by(customer_id))
## # A tibble: 10,531 × 182
##    customer_id birth_year gender income credit married college_degree region   
##          <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
##  1        1001       1971 Female  73000   742. No      No             South    
##  2        1002       1970 Female  31000   749. Yes     No             West     
##  3        1003       1988 Male    35000   542. No      No             South    
##  4        1004       1984 Other   64000   574. Yes     Yes            Midwest  
##  5        1005       1987 Male    58000   644. No      Yes            West     
##  6        1006       1994 Male   164000   554. Yes     Yes            Midwest  
##  7        1007       1968 Male    39000   608. No      No             Midwest  
##  8        1008       1994 Male    69000   710. No      No             South    
##  9        1009       1958 Male   233000   702. No      No             West     
## 10        1010       1994 Female  77000   605. Yes     No             Northeast
## # ℹ 10,521 more rows
## # ℹ 174 more variables: state <chr>, review_id <dbl>, star_rating <dbl>,
## #   review_time <chr>, review_title <chr>, review_text <chr>, jan_2005 <dbl>,
## #   feb_2005 <dbl>, mar_2005 <dbl>, apr_2005 <dbl>, may_2005 <dbl>,
## #   jun_2005 <dbl>, jul_2005 <dbl>, aug_2005 <dbl>, sep_2005 <dbl>,
## #   oct_2005 <dbl>, nov_2005 <dbl>, dec_2005 <dbl>, jan_2006 <dbl>,
## #   feb_2006 <dbl>, mar_2006 <dbl>, apr_2006 <dbl>, may_2006 <dbl>, …

An inner join keeps rows that have matching IDs along with all columns from both data frames.

inner_join(customer_data, store_transactions, join_by(customer_id))
## # A tibble: 10,531 × 182
##    customer_id birth_year gender income credit married college_degree region   
##          <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr>    
##  1        1001       1971 Female  73000   742. No      No             South    
##  2        1002       1970 Female  31000   749. Yes     No             West     
##  3        1003       1988 Male    35000   542. No      No             South    
##  4        1004       1984 Other   64000   574. Yes     Yes            Midwest  
##  5        1005       1987 Male    58000   644. No      Yes            West     
##  6        1006       1994 Male   164000   554. Yes     Yes            Midwest  
##  7        1007       1968 Male    39000   608. No      No             Midwest  
##  8        1008       1994 Male    69000   710. No      No             South    
##  9        1009       1958 Male   233000   702. No      No             West     
## 10        1010       1994 Female  77000   605. Yes     No             Northeast
## # ℹ 10,521 more rows
## # ℹ 174 more variables: state <chr>, review_id <dbl>, star_rating <dbl>,
## #   review_time <chr>, review_title <chr>, review_text <chr>, jan_2005 <dbl>,
## #   feb_2005 <dbl>, mar_2005 <dbl>, apr_2005 <dbl>, may_2005 <dbl>,
## #   jun_2005 <dbl>, jul_2005 <dbl>, aug_2005 <dbl>, sep_2005 <dbl>,
## #   oct_2005 <dbl>, nov_2005 <dbl>, dec_2005 <dbl>, jan_2006 <dbl>,
## #   feb_2006 <dbl>, mar_2006 <dbl>, apr_2006 <dbl>, may_2006 <dbl>, …

Consecutive Function Calls

We typically want to perform many consecutive function calls. You might be tempted to do the following. Don’t do this!

crm_data_2 <- left_join(customer_data, store_transactions, join_by(customer_id))
crm_data_3 <- filter(crm_data_2, region == "West", feb_2005 == max(feb_2005))
crm_data_4 <- mutate(crm_data_3, age = 2022 - birth_year)
crm_data_5 <- select(crm_data_4, age, feb_2005)
crm_data_6 <- arrange(crm_data_5, desc(age))
crm_data_7 <- slice(crm_data_6, 1)

crm_data_7
## # A tibble: 1 × 2
##     age feb_2005
##   <dbl>    <dbl>
## 1    65        5

This would be another way run the same code. Don’t do this either!

slice(
  arrange(
    select(
      mutate(
        filter(
          left_join(
            customer_data, store_transactions, join_by(customer_id)), 
          region == "West", feb_2005 == max(feb_2005)
        ), 
      age = 2022 - birth_year), 
    age, feb_2005), 
  desc(age)), 
1)
## # A tibble: 1 × 2
##     age feb_2005
##   <dbl>    <dbl>
## 1    65        5

Consecutive Lines of Code

Part of the common philosophy for the tidyverse is that:

  • Each function should do one specific thing well.
  • Each function should have a data frame as an input and a data frame as an output.

This allows us to |> together functions in consecutive lines of code so that it is easier for humans to read and less error-prone.

filter(customer_data, credit == 850, state == "WY")
## # A tibble: 1 × 14
##   customer_id birth_year gender income credit married college_degree region
##         <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr> 
## 1        1335       1978 Female 112000    850 No      No             West  
## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>,
## #   review_time <chr>, review_title <chr>, review_text <chr>
customer_data |> 
  filter(.data = _, credit == 850, state == "WY")
## # A tibble: 1 × 14
##   customer_id birth_year gender income credit married college_degree region
##         <dbl>      <dbl> <chr>   <dbl>  <dbl> <chr>   <chr>          <chr> 
## 1        1335       1978 Female 112000    850 No      No             West  
## # ℹ 6 more variables: state <chr>, review_id <dbl>, star_rating <dbl>,
## #   review_time <chr>, review_title <chr>, review_text <chr>

  • So what does the |> do?
  • What’s the deal with the indented line after the |>?
  • Let’s rewrite our terrible code using the |>.

Read |> as then. (If we need to use <-, we typically put it at the beginning.)

customer_data |> 
  left_join(store_transactions, join_by(customer_id)) |> 
  filter(region == "West", feb_2005 == max(feb_2005)) |> 
  mutate(age = 2023 - birth_year) |> 
  select(age, feb_2005) |> 
  arrange(desc(age)) |> 
  slice(1)
## # A tibble: 1 × 2
##     age feb_2005
##   <dbl>    <dbl>
## 1    66        5

Wrapping Up

Summary

  • Covered the basics of coding in R.
  • Practiced some essential functions for transforming data with {dplyr}.

Next Time

  • Summarizing discrete data with {dplyr}.
  • Visualizing discrete data with {ggplot2}.

Supplementary Material

  • R for Data Science (2e) Chapters 3, 4, and 5

Artwork by @allison_horst

Exercise 2

In RStudio on Posit Cloud, create a new R script, load the tidyverse, and import the data. Then, using the |> in consecutive lines of code, find the customers who have spent the most recently by doing the following.

  1. Join the customer_data and store_transactions data.
  2. Only keep customers in the South.
  3. Create a new variable age using 2023 - birth_year.
  4. Only keep the variables age, gender, income, credit, married, college_degree, region, and dec_2018.
  5. Arrange the data in descending order based on dec_2018 transactions.
  6. Keep the top 3 rows of data.
  7. Who appears to be purchasing the most items for this month in the South?
  8. Export the R script and upload to Canvas.